The objective of this second assignment is to explore the differentially expressed genes from the cleaned and normalized data in assignmnent one. Then rank the thresholded over-representation analysis to highlight the top terms / dominant themes in the top set of genes. Lastly, compare my result with the original literature and find some other supports as well for my result if possible. I make some changes to my assignment one and stored the file as “ammended_A1.Rmd” and imported it here. I will demonstrate briefly what I have changed in general workflow in A1. My summary of the paper can be found at my journal or in A1.
(Ritchie et al. 2015) (Robinson, McCarthy, and Smyth 2010) (Rainer 2017) (Xie 2020) (Gu, Eils, and Schlesner 2016) (Gu et al. 2014) (Kolberg and Raudvere 2019)
My data is from GSE84054 (Goh et al. 2017)
What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.
Loaded my normalized data from A1.
normalized_count_data <- read.table(file="GSE84054_normalized_count.txt")
kable(normalized_count_data[1:5, 1:5], type="html")
| GENEID | SYMBOL | RHB037 | RHB038 | RHB041 | |
|---|---|---|---|---|---|
| ENSG00000000003 | ENSG00000000003 | TSPAN6 | 79.809196 | 63.375225 | 37.719858 |
| ENSG00000000419 | ENSG00000000419 | DPM1 | 36.461054 | 38.451386 | 49.614654 |
| ENSG00000000457 | ENSG00000000457 | SCYL3 | 12.060195 | 12.077110 | 6.329049 |
| ENSG00000000460 | ENSG00000000460 | C1orf112 | 6.762435 | 4.972927 | 3.848316 |
| ENSG00000000938 | ENSG00000000938 | FGR | 1.402348 | 3.966502 | 1.940060 |
We need to explore the data again after normalization to ensure the normalized data reaches our expectations.
# After normalization
data2plot_after <- log2(normalized_count_data[,3:ncol(normalized_count_data)])
{boxplot(data2plot_after, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Filtered and normalized RNASeq Samples")
abline(h = median(apply(data2plot_after, 2, median)), col = "green", lwd = 0.6, lty = "dashed")}
counts_density <- apply(log2(normalized_count_data[,3:ncol(normalized_count_data)]), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",cex.lab = 0.85,
main = "Filtered and normalized RNASeq Samples distribution")
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(data2plot), col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4", merge = TRUE, bg = "gray90")
heatmap_matrix <- normalized_count_data[,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENEID
colnames(heatmap_matrix) <- rownames(samples_filtered)
# MDS plot by "cell_type" in samples
plotMDS(heatmap_matrix, labels=rownames(samples_filtered),
col = c("darkgreen","blue")[factor(samples_filtered$cell_type)],
main = "MDS plot depending on cell type")
pat_colors <- rainbow(12)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
# MDS plot by "cell_type" + "patients"in samples
plotMDS(heatmap_matrix, col = pat_colors,
main = "MDS plot depending on both cell type and patients")
Based on the two models in part1, I decide to base my model only on “cell_type”
model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
| (Intercept) | samples$cell_typeSphere |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.
expressionMatrix <- as.matrix(normalized_count_data[,3:ncol(normalized_count_data)])
rownames(expressionMatrix) <- normalized_count_data$GENEID
colnames(expressionMatrix) <- colnames(normalized_count_data)[3:ncol(normalized_count_data)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# fit
fit <- lmFit(minimalSet, model_design)
# Use Bayes
fit2 <- eBayes(fit,trend=TRUE)
# Correction: BH (recommended by the paper)
topfit <- topTable(fit2, coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
# merge gene symbol to topfit table
output_hits <- merge(normalized_count_data[,1:2], topfit, by.y = 0, by.x = 1, all.y=TRUE)
#sorted by P-value
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:5,],type="html")
| GENEID | SYMBOL | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 16973 | ENSG00000235865 | GSN-AS1 | -4.9824524 | 2.8883230 | -10.370167 | 0 | 6.20e-06 | 5.229345 |
| 3787 | ENSG00000113575 | PPP2CA | 101.3281947 | 88.8886109 | 9.056963 | 0 | 4.02e-05 | 4.390996 |
| 19346 | ENSG00000267904 | CTC-429P9.5 | -0.8873553 | 0.6414619 | -8.675083 | 0 | 5.57e-05 | 4.108874 |
| 3158 | ENSG00000108091 | CCDC6 | 78.2685270 | 64.3139803 | 8.563287 | 0 | 5.57e-05 | 4.022672 |
| 6795 | ENSG00000138600 | SPPL2A | 22.4430356 | 22.1442964 | 8.151324 | 0 | 9.54e-05 | 3.690225 |
# number of genes that pass threshold p-value = 0.05
length(which(output_hits$P.Value < 0.05)) # 6988
# number of genes that pass correction
length(which(output_hits$adj.P.Val < 0.05)) # 4062
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE,
main = "Binomial distribution of my data")
The individual dots represent each gene and the blue line is the overall trend line.
plotBCV(d,col.tagwise = "black",col.common = "red",
main = "BCV plot of RNA-seq data")
I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. The result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well(Goh et al. 2017) . There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.
fit <- glmQLFit(d, model_design)
qlf.sphere_vs_tumour <- glmQLFTest(fit, coef='samples$cell_typeSphere')
kable(topTags(qlf.sphere_vs_tumour), type="html")
| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| ENSG00000133574 | -10.864552 | 4.3344790 | 211.3416 | 0 | 0 |
| ENSG00000106483 | -12.504407 | 5.1136480 | 184.1976 | 0 | 0 |
| ENSG00000179144 | -10.957441 | 4.7967197 | 159.9882 | 0 | 0 |
| ENSG00000163563 | -9.261757 | 3.3981949 | 156.9675 | 0 | 0 |
| ENSG00000130300 | -12.870252 | 5.0189032 | 153.4504 | 0 | 0 |
| ENSG00000211668 | -11.278848 | 3.4432853 | 146.3944 | 0 | 0 |
| ENSG00000147113 | -10.742034 | 3.3588694 | 141.1675 | 0 | 0 |
| ENSG00000167286 | -10.447530 | 2.6237953 | 140.2040 | 0 | 0 |
| ENSG00000120279 | -9.829420 | 2.4408378 | 138.9715 | 0 | 0 |
| ENSG00000106952 | -7.721763 | -0.0200685 | 138.2581 | 0 | 0 |
| x |
|---|
| BH |
| x |
|---|
| samples$cell_typeSphere |
| x |
|---|
| glm |
# Get all the results
qlf_output_hits <- topTags(qlf.sphere_vs_tumour,
sort.by = "PValue",
n = nrow(normalized_count_data))
# Number of genes that pass the threshold p-value = 0.05
length(which(qlf_output_hits$table$PValue < 0.05)) # 7467
# Number of genes that pass correction
length(which(qlf_output_hits$table$FDR < 0.05)) # 5033
I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis on gProfileR.
# number of genes that are up regulated
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)) # 1897
# number of genes that are down regulated
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)) # 5570
# Get those up and down-regulated genes
qlf_output_hits_withgn <- merge(expr_filtered[,1:2],qlf_output_hits, by.x=1, by.y = 0)
upregulated_genes <- qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)]
downregulated_genes <-qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)]
# store data - all differentially expressed
unreg_genes_copy <- data.frame(upregulated_genes)
downreg_genes_copy <- data.frame(downregulated_genes)
names(unreg_genes_copy) <- names(downreg_genes_copy)
all_de <- rbind(unreg_genes_copy, downreg_genes_copy)
colnames(all_de) <- "all_de"
write.table(x=all_de,
file="all_expr_de_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
# up regulated
write.table(x=upregulated_genes,
file="expr_upregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
# down regulated
write.table(x=downregulated_genes,
file="expr_downregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from (Annick Moisan, n.d.)
volcanoData <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(volcanoData) <- c("logFC", "Pval")
up <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
point.col <- ifelse(up, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Up-regulated genes in RNA-seq data")
down <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(down, "blue", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Down-regulated genes in RNA-seq data")
To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to show that differential expression exists.
top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits),pattern = "_P"),
grep(colnames(heatmap_matrix_tophits),pattern = "_S"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,)
current_heatmap
There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.
The some analysis is apply to down regulated
upregulated_genes_sym <- qlf_output_hits_withgn$SYMBOL[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 1)]
length(upregulated_genes_sym) # 1312
upregulated_genes_sym[grep(pattern="ALDH",upregulated_genes_sym)]
# ALDH2, ALDH8A1, ALDH1L2 -> confirmed!
Annick Moisan, Nathalie Villa-Vialaneix, Ignacio Gonzales. n.d. “Practical Statistical Analysis of Rna-Seq Data - edgeR - Tomato Data.”
Goh, Jian Yuan, Min Feng, Wenyu Wang, Gokce Oguz, Siti Maryam JM Yatim, Puay Leng Lee, Yi Bao, et al. 2017. “Chromosome 1q21.3 Amplification Is a Trackable Biomarker and Actionable Target for Breast Cancer Recurrence.” Nature Medicine 23 (11). Nature Publishing Group: 1319.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.
Kolberg, Liis, and Uku Raudvere. 2019. Gprofiler2: Interface to the ’G:Profiler’ Toolset. https://CRAN.R-project.org/package=gprofiler2.
Rainer, Johannes. 2017. EnsDb.Hsapiens.v75: Ensembl Based Annotation Package.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Robinson, M D, D J McCarthy, and G K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Vasiliou, Vasilis, and Daniel W Nebert. 2005. “Analysis and Update of the Human Aldehyde Dehydrogenase (Aldh) Gene Family.” Human Genomics 2 (2). BioMed Central: 138.
Vassalli, Giuseppe. 2019. “Aldehyde Dehydrogenases: Not Just Markers, but Functional Regulators of Stem Cells.” Stem Cells International 2019. Hindawi.
Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.